2014~2016.11.18 NBA Season Game Log, 2006~2016 NBA Season Team csv Datadplyr, tidyr, xts, lubridate, qtlcharts, forecast, tseries, leaflet, ggplot2, plotly, dygraphs, viridis, graphicsThis is an exploratory analysis for data collected from NBA.
These days, many sport clubs are using statistical analysis to run the club more efficiently. Due to a development of technology, many types of data now can be collected from many sport games. Nowadays, sport of data does not only refer to baseball. Especially, basketball, for instance, also provides variety of large amount of data. We thought, it would be interesting to analyze sports data which is not about baseball. That’s why we made an analysis on NBA data.
First, we wanted to figure out which factor is most influential on outcomes of games and, by extension, making playoff. Making the playoff is one of the most important goal of the season. It gives invaluable experience to the team. We thought key factors of winning one game and going to the playoffs may be different. There, we made two individual analyzations to see if there is really a difference.
The source of the data is “http://stats.nba.com/”. This page provides information of games and teams of the NBA. The match data are collected from 2014 to Nov 18th, 2016 and the season average data of teams are collected from 2006 to 2016.
# Load packages
library(gdata)
library(dplyr) # Data manipulation
library(reshape) # Data manipulation
library(randomForest) # Data Analysis
library(ROCR) # Data Analysis
library(corrplot) # Correlation plot
library(ggplot2) # Visualization
library(plotly) # Visualization
library(MASS)
library(flux)
library(qtlcharts)Now that our packages are loaded, let’s read in and check the attributes of data.
Every team has its own winning strategy. For instance, “Golden State Warriors” is the famous team for high percentage of three-pointers and “San Antonio Spurs” prefers to pass the ball to one another until they get a perfect chance of scoring. However, we assumed that there would be common winning factors and we wanted to figure it out. Using the data, we made a model that can predict the winning rate. Additionally, we tested the model with test data from the latest season.
We spent very long time on data handling. But I thought the presentation time may not be enough to explain all of it. Therefore, we are going to skip our explanation on the data handling. If you want to know about what we have done, please check for the given handout.
# nba games training set
nba1415<- read.csv("C:\\Users\\jo\\Desktop\\1st 논문\\nba2014_2015.csv", header=TRUE)
nba1516<- read.csv("C:\\Users\\jo\\Desktop\\1st 논문\\nba2015_2016.csv", header=TRUE)
nba<-rbind(nba1415,nba1516)
names(nba)<-c("team","date","wl","min","points","FGM","FGA","FGP","THPM","THPA",
"THPP","FTM","FTA","FTP","OREB","DREB","REB","AST","STL","BLK","TOV","PF","PM")
levels(nba$team)[32]<-"Los Angeles Clippers"
# nba games test set
wltest<- read.csv("C:\\Users\\jo\\Desktop\\1st 논문\\2016_2017_test.csv", header=TRUE)
names(wltest)<-c("team","match","date","wl","min","points","FGM","FGA","FGP","THPM","THPA",
"THPP","FTM","FTA","FTP","OREB","DREB","REB","AST","STL","BLK","TOV","PF","PM")
levels(wltest$team)[13]<-"Los Angeles Clippers"
# training set
nba$TPA<-nba$FGA-nba$THPA
nba$TPM<-nba$FGM-nba$THPM
nba$TPP<-(nba$TPM/nba$TPA)*100
# test set
wltest$TPA<-wltest$FGA-wltest$THPA
wltest$TPM<-wltest$FGM-wltest$THPM
wltest$TPP<-wltest$TPM/wltest$TPA*100
# team, date, min, pm, points, fgp, fgm, fga 제거
# training set
wl<-nba[,!names(nba) %in% c("points","FGP","FGM","FGA","team","date","min","PM")]
# test set
wltest<-wltest[,!names(wltest) %in% c("points","FGP","FGM","FGA","team","date","min","PM","match")]
# quick view of data
knitr::kable(head(wl,n=3))| wl | THPM | THPA | THPP | FTM | FTA | FTP | OREB | DREB | REB | AST | STL | BLK | TOV | PF | TPA | TPM | TPP |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| W | 11 | 29 | 37.9 | 19 | 24 | 79.2 | 11 | 38 | 49 | 39 | 8 | 6 | 17 | 23 | 80 | 46 | 57.50000 |
| L | 15 | 43 | 34.9 | 22 | 25 | 88.0 | 27 | 34 | 61 | 29 | 10 | 1 | 15 | 26 | 82 | 38 | 46.34146 |
| W | 18 | 35 | 51.4 | 20 | 28 | 71.4 | 11 | 28 | 39 | 37 | 4 | 2 | 9 | 16 | 47 | 33 | 70.21277 |
str(wl) # ~ obs. of ~variables## 'data.frame': 4920 obs. of 18 variables:
## $ wl : Factor w/ 2 levels "L","W": 2 1 2 2 2 2 2 2 2 1 ...
## $ THPM: int 11 15 18 15 12 11 14 4 8 13 ...
## $ THPA: int 29 43 35 20 28 35 28 15 16 27 ...
## $ THPP: num 37.9 34.9 51.4 75 42.9 31.4 50 26.7 50 48.1 ...
## $ FTM : int 19 22 20 12 22 44 30 19 29 29 ...
## $ FTA : int 24 25 28 16 31 49 34 24 33 35 ...
## $ FTP : num 79.2 88 71.4 75 71 89.8 88.2 79.2 87.9 82.9 ...
## $ OREB: int 11 27 11 3 18 19 11 4 10 7 ...
## $ DREB: int 38 34 28 31 28 31 40 30 29 32 ...
## $ REB : int 49 61 39 34 46 50 51 34 39 39 ...
## $ AST : int 39 29 37 31 22 22 32 35 23 17 ...
## $ STL : int 8 10 4 13 10 6 4 8 9 7 ...
## $ BLK : int 6 1 2 3 6 4 5 1 7 3 ...
## $ TOV : int 17 15 9 13 13 14 16 6 17 14 ...
## $ PF : int 23 26 16 30 25 27 26 23 27 35 ...
## $ TPA : int 80 82 47 64 67 64 58 76 73 63 ...
## $ TPM : int 46 38 33 41 40 30 32 52 41 33 ...
## $ TPP : num 57.5 46.3 70.2 64.1 59.7 ...
These are the names, class type of variables. We can also check first few observations. In total, there are 4920 observations and 18 variables. Simple description of the variables is as follows. :
| Variable Name | Description | Variable Name | Description |
|---|---|---|---|
| wl | Game Result | OREB | Offensive Rebound |
| TPM | 2 Points Made | DREB | Defensive Rebound |
| TPA | 2 Points Attempted | REB | Total Rebound |
| TPP | 2 Points Percentage | AST | Assist |
| THPM | Three Points Made | STL | Steal |
| THPA | Three Points Attempted | BLK | Block |
| THPP | Three Points Percentage | TOV | Turn Over |
| FTM | Free Throw Made | PF | Personal Foul |
| FTA | Free Throw Attempted | ||
| FTP | Free Throw Percentage |
We used random forest method and correlation matrix to select variables.
set.seed(0704)
wlrf<-randomForest(wl~ .,data=wl,ntree=200,proximity=TRUE)
print(wlrf)##
## Call:
## randomForest(formula = wl ~ ., data = wl, ntree = 200, proximity = TRUE)
## Type of random forest: classification
## Number of trees: 200
## No. of variables tried at each split: 4
##
## OOB estimate of error rate: 20.3%
## Confusion matrix:
## L W class.error
## L 1977 483 0.1963415
## W 516 1944 0.2097561
varImpPlot(wlrf)This is the result of the random forest. The error rate is 20.3%. Since this is the prediction of winning rate, 80% of accuracy seems about right.
TPP and THPP showed the largest importance. It may be reasonable to say that scoring is the first prerequisite for winning.
DREB and REB were also very important variables. Rebound refers to consistency of the team and consistency is about making less mistakes.
# 실제 데이터로 test
wlPred <- predict(wlrf, newdata=wltest)
wltesttable<-table(wlPred, wltest$wl)
acc<-function(x) sum(diag(x))/sum(x)
err<-function(x) (sum(x)-sum(diag(x)))/sum(x)
list("confusion matrix"=wltesttable,"accuracy"=acc(table(wlPred, wltest$wl)),
"error"=err(table(wlPred, wltest$wl)))## $`confusion matrix`
##
## wlPred L W
## L 166 36
## W 45 175
##
## $accuracy
## [1] 0.8080569
##
## $error
## [1] 0.1919431
This is the result of the random forest applied to the test data. The error rate is 19.1%, which is about the same with the above, 20.3%.
iplotCorr(wl[,-1], chartOpts=list(cortitle="NBA Season", scattitle="Scatterplot"))(TPA, TPM), (TPP, TPM), (THPA, THPM), (THPP, THPM), (FTA, FTM), and (REB, DREB) showed the absolute value of correlation coefficient higher than 0.6. When you see the above result of variable importance from random forest, TPP and THPP showed the largest importance. Therefore, we removed TPM and THPM. DREB showed larger importance than REB and FTM showed larger importance than FTA. Hence, we removed REB and FTA.
We first ran glm against dependent variable, wl; game result. If there had been invalid variables, we removed them and ran glm again.
If there had only been valid variables in glm, we ran anova to see whether we can reduce the model. Since size of the logistic regression model does not increase the explanatory power, smaller size of model that has same explanatory power is always better. After removing certain variables we ran glm again.
# glm 을 이용해 modeling
wllr<-glm(wl~TPP+THPP+DREB+AST+TOV+FTP+FTM+STL+PF+TPA+BLK+THPA+OREB,family=binomial,data=wl)
# glm 을 이용해 modeling
wllr2<-glm(wl~TPP+THPP+DREB+AST+TOV+FTP+FTM+STL+PF+BLK+THPA+OREB,family=binomial,data=wl)
# glm에서 AST가 귀무가설 기각이 안되므로, 빼고 다시 모델링
wllr3<-glm(wl~TPP+THPP+DREB+TOV+FTP+FTM+STL+PF+BLK+THPA+OREB,family=binomial,data=wl)
# anova에서 THPA가 귀무가설 기각이 안되므로, 빼고 다시 모델링
wllr4<-glm(wl~TPP+THPP+DREB+TOV+FTP+FTM+STL+PF+BLK+OREB,family=binomial,data=wl)
summary(wllr4)##
## Call:
## glm(formula = wl ~ TPP + THPP + DREB + TOV + FTP + FTM + STL +
## PF + BLK + OREB, family = binomial, data = wl)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.4759 -0.5558 -0.0018 0.5385 3.3925
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -28.511079 0.933406 -30.545 < 2e-16 ***
## TPP 0.237567 0.008768 27.094 < 2e-16 ***
## THPP 0.132376 0.005219 25.365 < 2e-16 ***
## DREB 0.286553 0.010682 26.826 < 2e-16 ***
## TOV -0.203480 0.011548 -17.620 < 2e-16 ***
## FTP 0.027811 0.004276 6.504 7.80e-11 ***
## FTM 0.088723 0.007600 11.674 < 2e-16 ***
## STL 0.273486 0.015607 17.524 < 2e-16 ***
## PF -0.115415 0.010251 -11.259 < 2e-16 ***
## BLK 0.111024 0.016527 6.718 1.84e-11 ***
## OREB 0.166996 0.012138 13.758 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 6820.6 on 4919 degrees of freedom
## Residual deviance: 3705.2 on 4909 degrees of freedom
## AIC: 3727.2
##
## Number of Fisher Scoring iterations: 6
The final model includes TPP, THPP, DREB, TOV, FTP, FTM, STL, PF, BLK and OREB for explanatory variables.
\[p=\frac{e^\beta}{1+e^\beta}\] \[\beta = -28.51+0.24·TPP+0.13·THPP+0.29·DREB-0.20·TOV+0.03·FTP\] \[+0.09·FTM+0.27·STL-0.12·PF+0.11·BLK+0.17·OREB\]
DREB showed the largest coefficient followed by STL and TPP.# 이길 확률 추측값
wl.lr.prob <- predict(wllr4, newdata = wltest, type = "response")
# Confusion matrix
wl.lr.pred <- ifelse(wl.lr.prob > 0.5, "W", "L")
wl.lr.tab <- table("Actual" = wltest$wl,
"Predicted" = wl.lr.pred)
list("confustion matrix"=wl.lr.tab, accuracy=acc(wl.lr.tab),error=err(wl.lr.tab))## $`confustion matrix`
## Predicted
## Actual L W
## L 159 52
## W 39 172
##
## $accuracy
## [1] 0.7843602
##
## $error
## [1] 0.2156398
The model’s accuracy was 78.4%. We assumed that the model is worth of using.
# auc
wl.lr.pred <- prediction(wl.lr.prob, wltest$wl)
wl.lr.perform <- performance(wl.lr.pred, measure = "tpr", x.measure = "fpr")
plot(wl.lr.perform,main="ROC Curve for Logistic Regression Model")
abline(a=0,b=1,col=2,lwd=3)wl.lr.auc <- performance(wl.lr.pred, measure = "auc")
wl.lr.auc <- wl.lr.auc@y.values[[1]]
wl.lr.auc## [1] 0.890591
We drew Receiver Operating Characteristic(ROC) Curve to evaluate the model. Area Under the Curve(AUC) was 0.89. It seems like logistic regression model classifies the data very well.
We put each teams’ values of variables into the model to find out how well the model can predict the winning rate.
# 2015-16 season data
s1516<-read.csv("C:\\Users\\jo\\Desktop\\1st 논문\\15_16_season.csv", header=TRUE)
# 2014-15 east season data
se1415<-read.csv("C:\\Users\\jo\\Desktop\\1st 논문\\14_15_season_east.csv", header=TRUE)
# 2014-15 west season data
sw1415<-read.csv("C:\\Users\\jo\\Desktop\\1st 논문\\14_15_season_west.csv", header=TRUE)
# change variable names
names(s1516)<-c("team","GP","W","L","WP","MIN","FGM","FGA","FGP","THPM","THPA","THPP",
"FTM","FTA","FTP","OREB","DREB","REB","AST","TOV","STL","BLK","BLKA",
"PF","PFD","points","PL")
names(se1415)<-c("team","GP","W","L","WP","MIN","FGM","FGA","FGP","THPM","THPA","THPP",
"FTM","FTA","FTP","OREB","DREB","REB","AST","TOV","STL","BLK","BLKA",
"PF","PFD","points","PL")
names(sw1415)<-c("team","GP","W","L","WP","MIN","FGM","FGA","FGP","THPM","THPA","THPP",
"FTM","FTA","FTP","OREB","DREB","REB","AST","TOV","STL","BLK","BLKA",
"PF","PFD","points","PL")
s1516$TPA<-s1516$FGA-s1516$THPA
s1516$TPM<-s1516$FGM-s1516$THPM
s1516$TPP<-(s1516$TPM/s1516$TPA)*100
se1415$TPA<-se1415$FGA-se1415$THPA
se1415$TPM<-se1415$FGM-se1415$THPM
se1415$TPP<-(se1415$TPM/se1415$TPA)*100
sw1415$TPA<-sw1415$FGA-sw1415$THPA
sw1415$TPM<-sw1415$FGM-sw1415$THPM
sw1415$TPP<-(sw1415$TPM/sw1415$TPA)*100
# 15-16년도 에는 LA Clippers 팀 명칭이 달리되어 있어 수정.
levels(s1516$team)[13]<-"Los Angeles Clippers"
# s1516 데이터를 east, west로 분할
se1516<-s1516[s1516$team %in% se1415$team,]
sw1516<-s1516[s1516$team %in% sw1415$team,]
# 승 순으로 재 나열
se1415<-arrange(se1415,desc(WP))
sw1415<-arrange(sw1415,desc(WP))
se1516<-arrange(se1516,desc(WP))
sw1516<-arrange(sw1516,desc(WP))# 모델을 이용해 승률 예측과 실제 결과와의 비교 (1415동부)
pse1415 <- predict(wllr4, newdata=se1415, type="response")
se1415$index<-1:nrow(se1415)
se1415d<-data.frame(se1415$team,as.numeric(se1415$index),se1415$WP,pse1415*100)
names(se1415d)<-c("team","index","WP","P")
plot_ly(data = se1415d, type="scatter", x = ~index, y = ~WP, name="Test", text=~team,
marker = list(size = 10, color = 'rgba(255, 182, 193, .9)',
line = list(color = 'rgba(152, 0, 0, .8)',width = 2))) %>%
layout(title="14-15 East",
yaxis=list(title='Winning Rate'),
xaxis=list(title='Rank')) %>%
add_trace(y = ~P, type="scatter", name = 'Expected', mode = 'markers', text=~team,
marker = list(size = 10, color = 'rgba(172, 206, 249,0.9)',
line = list(color = 'rgba(1, 68, 155,0.8)',width = 2)))# 모델을 이용해 승률 예측과 실제 결과와의 비교 (1415서부)
psw1415<-predict(wllr4,newdata=sw1415,type="response")
sw1415$index<-1:nrow(sw1415)
sw1415d <- data.frame(sw1415$team,as.numeric(sw1415$index),sw1415$WP,psw1415*100)
names(sw1415d)<-c("team","index","WP","P")
plot_ly(data = sw1415d, type="scatter",x = ~index, y = ~WP, name="Test",text=~team,
marker = list(size = 10, color = 'rgba(255, 182, 193, .9)',
line = list(color = 'rgba(152, 0, 0, .8)',width = 2))) %>%
layout(title="14-15 West",
yaxis=list(title='Winning Rate'),
xaxis=list(title='Rank')) %>%
add_trace(y = ~P, type="scatter", name = 'Expected', mode = 'markers', text=~team,
marker = list(size = 10, color = 'rgba(172, 206, 249,0.9)',
line = list(color = 'rgba(1, 68, 155,0.8)',width = 2)))# 모델을 이용해 승률 예측과 실제 결과와의 비교 (1516동부)
pse1516 <- predict(wllr4, newdata=se1516, type="response")
se1516$index<-1:nrow(se1516)
se1516d <- data.frame(se1415$team,as.numeric(se1516$index),se1516$WP,pse1516*100)
names(se1516d)<-c("team","index","WP","P")
plot_ly(data = se1516d, type="scatter", x = ~index, y = ~WP, name="Test",text=~team,
marker = list(size = 10, color = 'rgba(255, 182, 193, .9)',
line = list(color = 'rgba(152, 0, 0, .8)',width = 2))) %>%
layout(title="15-16 East",
yaxis=list(title='Winning Rate'),
xaxis=list(title='Rank')) %>%
add_trace(y = ~P, type="scatter", name = 'Expected', mode = 'markers', text=~team,
marker = list(size = 10, color = 'rgba(172, 206, 249,0.9)',
line = list(color = 'rgba(1, 68, 155,0.8)',width = 2)))# 모델을 이용해 승률 예측과 실제 결과와의 비교 (1415서부)
psw1516<-predict(wllr4,newdata=sw1516,type="response")
sw1516$index<-1:nrow(sw1516)
sw1516d <- data.frame(sw1415$team,as.numeric(sw1516$index),sw1516$WP,psw1516*100)
names(sw1516d)<-c("team","index","WP","P")
plot_ly(data = sw1516d, type="scatter",x = ~index, y = ~WP, name="Test",text=~team,
marker = list(size = 10, color = 'rgba(255, 182, 193, .9)',
line = list(color = 'rgba(152, 0, 0, .8)',width = 2))) %>%
layout(title="15-16 West",
yaxis=list(title='Winning Rate'),
xaxis=list(title='Rank')) %>%
add_trace(y = ~P, type="scatter", name = 'Expected', mode = 'markers', text=~team,
marker = list(size = 10, color = 'rgba(172, 206, 249,0.9)',
line = list(color = 'rgba(1, 68, 155,0.8)',width = 2)))Generally,the model predicted winning rate of each teams very well. However, some of the predictions were not accurate.
We drew radar plot with few selected variables to see why these teams’ winning rates are not expected well.
Coefficients of the model refers to influence of explanatory variables. However, if the value of variables differ largely, it will be hard to say, variables having the largest coefficients have the largest influence on the winning rate. Therefore, we multiplied coefficients of the model with mean of each variables to measure the influence. We used variables that had five largest values on the radar plot.
\[(coefficient) * (mean.of.variables)\]
wllr4$coefficients[c("TPP","DREB","THPP","TOV","PF")]*
apply(nba[,c("TPP","DREB","THPP","TOV","PF")],2,mean)## TPP DREB THPP TOV PF
## 11.645877 9.420357 4.631249 -2.923783 -2.336037
TPP, DREB, THPP, TOV, and PF are chosen in descending order of influence.
east1415_MH<-se1415[se1415$team=="Miami Heat",]
west1415_MG<-sw1415[sw1415$team=="Memphis Grizzlies",]
east1516_WW<-se1516[se1516$team=="Washington Wizards",]
west1516_UJ<-sw1516[sw1516$team=="Utah Jazz",]
rdiff<-rbind(east1415_MH,west1415_MG,east1516_WW,west1516_UJ)
rdiff$team<-c("14-15 East: Miami Heat", "14-15 West: Memphis Grizzlies",
"15-16 East: Washington Wizards","15-16 West: Utah Jazz")
rdiff.m <- rdiff[,c(names(rdiff) %in% names(wl[,c(18,9,4,14,15)]))]
rdiff.m$team<-rdiff[,1]
rdiff.p<- data.frame(team=rdiff.m[,6],
TPP=round(rdiff.m[,5]^4/max((rdiff.m[,5])^4),2),
DREB=round(rdiff.m[,2]^4/max((rdiff.m[,2])^4),2),
THPP=round(rdiff.m[,1]^4/max(rdiff.m[,1])^4,2),
TOV=round(rdiff.m[,3]^4/max(rdiff.m[,3])^4,2),
PF=round(rdiff.m[,4]^4/max(rdiff.m[,4])^4,2))
rdiff.d <- data.frame(team=rdiff.m[,6],
TPP=round(rdiff.m[,5]^4/max((rdiff.m[,5])^4),2),
DREB=round(rdiff.m[,2]^4/max((rdiff.m[,2])^4),2),
THPP=round(rdiff.m[,1]^4/max(rdiff.m[,1])^4,2),
TOV=round(rdiff.m[,3]^4/max(rdiff.m[,3])^4,2),
PF=round(rdiff.m[,4]^4/max(rdiff.m[,4])^4,2))
rdiff.d <- melt(rdiff.d,id=c("team"))
rdiff.d <- rdiff.d[order(rdiff.d$team),]
rdiff.d$degree <- seq(0,288,72) # 24 responses, equals 15 degrees per response
rdiff.d$o <- rdiff.d$value * sin(rdiff.d$degree * pi / 180) # SOH
rdiff.d$a <- rdiff.d$value * cos(rdiff.d$degree * pi / 180) # CAH
rdiff.d$o100 <- 1 * sin(rdiff.d$degree * pi / 180) # Outer ring x
rdiff.d$a100 <- 1 * cos(rdiff.d$degree * pi / 180) # Outer ring y
p = plot_ly()
for(i in 1:5) {
p <- add_trace(
p,
x = c(rdiff.d$o100[i],0),
y = c(rdiff.d$a100[i],0),
evaluate = TRUE,
line = list(color = "#d3d3d3", dash = "3px"),
showlegend = FALSE
)
}
rdiff.d.rad<-rdiff.d[c(1:20,1,6,11,16),]
p %>%
add_trace(data = rdiff.d.rad,
x = rdiff.d.rad$o,
y = rdiff.d.rad$a,
color = rdiff.d.rad$team,
mode = "lines+markers",
hoverinfo = "text",
text = paste(rdiff.d.rad$team,
rdiff.d.rad$variable,round(rdiff.d.rad$value * 100), "%")) %>%
add_trace(data =rdiff.d, x = rdiff.d$o100, y = rdiff.d$a100,
text = rdiff.d$variable,
hoverinfo = "none",
textposition = "top middle", mode = "lines+text",
line = list(color = "#d3d3d3", dash = "3px", shape = "spline"),
showlegend = FALSE) %>%
layout(xaxis = list(range = c(-1.25,1.25), showticklabels = FALSE, zeroline = FALSE, showgrid = FALSE),
yaxis = list(range = c(-1.25,1.25), showticklabels = FALSE, zeroline = FALSE, showgrid = FALSE))knitr::kable(rdiff.p)| team | TPP | DREB | THPP | TOV | PF |
|---|---|---|---|---|---|
| 14-15 East: Miami Heat | 0.97 | 0.70 | 0.77 | 0.97 | 0.85 |
| 14-15 West: Memphis Grizzlies | 0.87 | 0.92 | 0.80 | 0.63 | 0.71 |
| 15-16 East: Washington Wizards | 1.00 | 1.00 | 1.00 | 0.90 | 1.00 |
| 15-16 West: Utah Jazz | 0.90 | 0.96 | 0.97 | 1.00 | 0.89 |
Although “Miami Heat” and “Memphis Grizzlies” had less personal fouls and turn overs than other two, the model expected them to have lower winning rates as their percentage of two-pointers, number of defensive rebounds, and percentage of three-pointers are smaller. We figured out that, if the team’s values of these variables are very different from other teams, then the performance of the model may be very poor. However, except for these kinds of cases, the model would predict team’s winning rate very well.
Every team wants to make it to the playoffs. Although only one team can earn championship title, making playoffs would give them invaluable experience. At the middle of the season, people often starts to assume that certain teams will be able to make playoffs. However, there is rarely a solid evidence in their reasoning. By using the similar procedure that we used in previous analysis, we are going to find out important factors of making playoffs. We made a model that can classify playoff teams.
# 2013-14 season data
s1314<-read.csv("C:\\Users\\jo\\Desktop\\1st 논문\\13_14_season.csv", header=TRUE)
names(s1314)<-c("team","GP","W","L","WP","MIN","FGM","FGA","FGP","THPM","THPA","THPP",
"FTM","FTA","FTP","OREB","DREB","REB","AST","TOV","STL","BLK","BLKA",
"PF","PFD","points","PL")
s1314$TPA<-s1314$FGA-s1314$THPA
s1314$TPM<-s1314$FGM-s1314$THPM
s1314$TPP<-(s1314$TPM/s1314$TPA)*100
levels(s1314$team)[4]<-"Charlotte Hornets"
se1314<-s1314[s1314$team %in% se1516$team,]
sw1314<-s1314[s1314$team %in% sw1516$team,]
se1314<-arrange(se1314,desc(WP))
sw1314<-arrange(sw1314,desc(WP))
# 2012-13 season data
s1213<-read.csv("C:\\Users\\jo\\Desktop\\1st 논문\\12_13_season.csv", header=TRUE)
names(s1213)<-c("team","GP","W","L","WP","MIN","FGM","FGA","FGP","THPM","THPA","THPP",
"FTM","FTA","FTP","OREB","DREB","REB","AST","TOV","STL","BLK","BLKA",
"PF","PFD","points","PL")
s1213$TPA<-s1213$FGA-s1213$THPA
s1213$TPM<-s1213$FGM-s1213$THPM
s1213$TPP<-(s1213$TPM/s1213$TPA)*100
levels(s1213$team)[4]<-"Charlotte Hornets"
levels(s1213$team)[19]<-"New Orleans Pelicans"
se1213<-s1213[s1213$team %in% se1516$team,]
sw1213<-s1213[s1213$team %in% sw1516$team,]
se1213<-arrange(se1213,desc(WP))
sw1213<-arrange(sw1213,desc(WP))
# 2011-12 season data
s1112<-read.csv("C:\\Users\\jo\\Desktop\\1st 논문\\11_12_season.csv", header=TRUE)
names(s1112)<-c("team","GP","W","L","WP","MIN","FGM","FGA","FGP","THPM","THPA","THPP",
"FTM","FTA","FTP","OREB","DREB","REB","AST","TOV","STL","BLK","BLKA",
"PF","PFD","points","PL")
s1112$TPA<-s1112$FGA-s1112$THPA
s1112$TPM<-s1112$FGM-s1112$THPM
s1112$TPP<-(s1112$TPM/s1112$TPA)*100
levels(s1112$team)[3]<-"Charlotte Hornets"
levels(s1112$team)[18]<-"Brooklyn Nets"
levels(s1112$team)[19]<-"New Orleans Pelicans"
se1112<-s1112[s1112$team %in% se1516$team,]
sw1112<-s1112[s1112$team %in% sw1516$team,]
se1112<-arrange(se1112,desc(WP))
sw1112<-arrange(sw1112,desc(WP))
# 2010-11 season data
s1011<-read.csv("C:\\Users\\jo\\Desktop\\1st 논문\\10_11_season.csv", header=TRUE)
names(s1011)<-c("team","GP","W","L","WP","MIN","FGM","FGA","FGP","THPM","THPA","THPP",
"FTM","FTA","FTP","OREB","DREB","REB","AST","TOV","STL","BLK","BLKA",
"PF","PFD","points","PL")
s1011$TPA<-s1011$FGA-s1011$THPA
s1011$TPM<-s1011$FGM-s1011$THPM
s1011$TPP<-(s1011$TPM/s1011$TPA)*100
levels(s1011$team)[3]<-"Charlotte Hornets"
levels(s1011$team)[18]<-"Brooklyn Nets"
levels(s1011$team)[19]<-"New Orleans Pelicans"
se1011<-s1011[s1011$team %in% se1516$team,]
sw1011<-s1011[s1011$team %in% sw1516$team,]
se1011<-arrange(se1011,desc(WP))
sw1011<-arrange(sw1011,desc(WP))
# 2009-10 season data
s0910<-read.csv("C:\\Users\\jo\\Desktop\\1st 논문\\09_10_season.csv", header=TRUE)
names(s0910)<-c("team","GP","W","L","WP","MIN","FGM","FGA","FGP","THPM","THPA","THPP",
"FTM","FTA","FTP","OREB","DREB","REB","AST","TOV","STL","BLK","BLKA",
"PF","PFD","points","PL")
s0910$TPA<-s0910$FGA-s0910$THPA
s0910$TPM<-s0910$FGM-s0910$THPM
s0910$TPP<-(s0910$TPM/s0910$TPA)*100
levels(s0910$team)[3]<-"Charlotte Hornets"
levels(s0910$team)[18]<-"Brooklyn Nets"
levels(s0910$team)[19]<-"New Orleans Pelicans"
se0910<-s0910[s0910$team %in% se1516$team,]
sw0910<-s0910[s0910$team %in% sw1516$team,]
se0910<-arrange(se0910,desc(WP))
sw0910<-arrange(sw0910,desc(WP))
# 2008-09 season data
s0809<-read.csv("C:\\Users\\jo\\Desktop\\1st 논문\\08_09_season.csv", header=TRUE)
names(s0809)<-c("team","GP","W","L","WP","MIN","FGM","FGA","FGP","THPM","THPA","THPP",
"FTM","FTA","FTP","OREB","DREB","REB","AST","TOV","STL","BLK","BLKA",
"PF","PFD","points","PL")
s0809$TPA<-s0809$FGA-s0809$THPA
s0809$TPM<-s0809$FGM-s0809$THPM
s0809$TPP<-(s0809$TPM/s0809$TPA)*100
levels(s0809$team)[3]<-"Charlotte Hornets"
levels(s0809$team)[18]<-"Brooklyn Nets"
levels(s0809$team)[19]<-"New Orleans Pelicans"
se0809<-s0809[s0809$team %in% se1516$team,]
sw0809<-s0809[s0809$team %in% sw1516$team,]
se0809<-arrange(se0809,desc(WP))
sw0809<-arrange(sw0809,desc(WP))
# 2007-08 season data
s0708<-read.csv("C:\\Users\\jo\\Desktop\\1st 논문\\07_08_season.csv", header=TRUE)
names(s0708)<-c("team","GP","W","L","WP","MIN","FGM","FGA","FGP","THPM","THPA","THPP",
"FTM","FTA","FTP","OREB","DREB","REB","AST","TOV","STL","BLK","BLKA",
"PF","PFD","points","PL")
s0708$TPA<-s0708$FGA-s0708$THPA
s0708$TPM<-s0708$FGM-s0708$THPM
s0708$TPP<-(s0708$TPM/s0708$TPA)*100
levels(s0708$team)[3]<-"Charlotte Hornets"
levels(s0708$team)[18]<-"Brooklyn Nets"
levels(s0708$team)[19]<-"New Orleans Pelicans"
levels(s0708$team)[27]<-"Oklahoma City Thunder"
se0708<-s0708[s0708$team %in% se1516$team,]
sw0708<-s0708[s0708$team %in% sw1516$team,]
se0708<-arrange(se0708,desc(WP))
sw0708<-arrange(sw0708,desc(WP))
# 2006-07 season data
s0607<-read.csv("C:\\Users\\jo\\Desktop\\1st 논문\\06_07_season.csv", header=TRUE)
names(s0607)<-c("team","GP","W","L","WP","MIN","FGM","FGA","FGP","THPM","THPA","THPP",
"FTM","FTA","FTP","OREB","DREB","REB","AST","TOV","STL","BLK","BLKA",
"PF","PFD","points","PL")
s0607$TPA<-s0607$FGA-s0607$THPA
s0607$TPM<-s0607$FGM-s0607$THPM
s0607$TPP<-(s0607$TPM/s0607$TPA)*100
levels(s0607$team)[3]<-"Charlotte Hornets"
levels(s0607$team)[18]<-"Brooklyn Nets"
levels(s0607$team)[19]<-"New Orleans Pelicans"
levels(s0607$team)[27]<-"Oklahoma City Thunder"
se0607<-s0607[s0607$team %in% se1516$team,]
sw0607<-s0607[s0607$team %in% sw1516$team,]
se0607<-arrange(se0607,desc(WP))
sw0607<-arrange(sw0607,desc(WP))
# 분석에 필요한 정보만 남기기
se1516<-se1516[,-c(1:4,6:9,26,27,31)]
sw1516<-sw1516[,-c(1:4,6:9,26,27,31)]
se1415<-se1415[,-c(1:4,6:9,26,27,31)]
sw1415<-sw1415[,-c(1:4,6:9,26,27,31)]
se1314<-se1314[,-c(1:4,6:9,26,27)]
sw1314<-sw1314[,-c(1:4,6:9,26,27)]
se1213<-se1213[,-c(1:4,6:9,26,27)]
sw1213<-sw1213[,-c(1:4,6:9,26,27)]
se1112<-se1112[,-c(1:4,6:9,26,27)]
sw1112<-sw1112[,-c(1:4,6:9,26,27)]
se1011<-se1011[,-c(1:4,6:9,26,27)]
sw1011<-sw1011[,-c(1:4,6:9,26,27)]
se0910<-se0910[,-c(1:4,6:9,26,27)]
sw0910<-sw0910[,-c(1:4,6:9,26,27)]
se0809<-se0809[,-c(1:4,6:9,26,27)]
sw0809<-sw0809[,-c(1:4,6:9,26,27)]
se0708<-se0708[,-c(1:4,6:9,26,27)]
sw0708<-sw0708[,-c(1:4,6:9,26,27)]
se0607<-se0607[,-c(1:4,6:9,26,27)]
sw0607<-sw0607[,-c(1:4,6:9,26,27)]
# Play-Off 진출 변수 생성
#1516
for (i in 1:15) {
se1516$PO[i]<-ifelse(se1516$WP[i]>=se1516$WP[8],"P","F")
}
for (i in 1:15) {
sw1516$PO[i]<-ifelse(sw1516$WP[i]>=sw1516$WP[8],"P","F")
}
#1415
for (i in 1:15) {
se1415$PO[i]<-ifelse(se1415$WP[i]>=se1415$WP[8],"P","F")
}
for (i in 1:15) {
sw1415$PO[i]<-ifelse(sw1415$WP[i]>=sw1415$WP[8],"P","F")
}
#1314
for (i in 1:15) {
se1314$PO[i]<-ifelse(se1314$WP[i]>=se1314$WP[8],"P","F")
}
for (i in 1:15) {
sw1314$PO[i]<-ifelse(sw1314$WP[i]>=sw1314$WP[8],"P","F")
}
#1213
for (i in 1:15) {
se1213$PO[i]<-ifelse(se1213$WP[i]>=se1213$WP[8],"P","F")
}
for (i in 1:15) {
sw1213$PO[i]<-ifelse(sw1213$WP[i]>=sw1213$WP[8],"P","F")
}
#1112
for (i in 1:15) {
se1112$PO[i]<-ifelse(se1112$WP[i]>=se1112$WP[8],"P","F")
}
for (i in 1:15) {
sw1112$PO[i]<-ifelse(sw1112$WP[i]>=sw1112$WP[8],"P","F")
}
#1011
for (i in 1:15) {
se1011$PO[i]<-ifelse(se1011$WP[i]>=se1011$WP[8],"P","F")
}
for (i in 1:15) {
sw1011$PO[i]<-ifelse(sw1011$WP[i]>=sw1011$WP[8],"P","F")
}
#0910
for (i in 1:15) {
se0910$PO[i]<-ifelse(se0910$WP[i]>=se0910$WP[8],"P","F")
}
for (i in 1:15) {
sw0910$PO[i]<-ifelse(sw0910$WP[i]>=sw0910$WP[8],"P","F")
}
#0809
for (i in 1:15) {
se0809$PO[i]<-ifelse(se0809$WP[i]>=se0809$WP[8],"P","F")
}
for (i in 1:15) {
sw0809$PO[i]<-ifelse(sw0809$WP[i]>=sw0809$WP[8],"P","F")
}
#0708
for (i in 1:15) {
se0708$PO[i]<-ifelse(se0708$WP[i]>=se0708$WP[8],"P","F")
}
for (i in 1:15) {
sw0708$PO[i]<-ifelse(sw0708$WP[i]>=sw0708$WP[8],"P","F")
}
#0607
for (i in 1:15) {
se0607$PO[i]<-ifelse(se0607$WP[i]>=se0607$WP[8],"P","F")
}
for (i in 1:15) {
sw0607$PO[i]<-ifelse(sw0607$WP[i]>=sw0607$WP[8],"P","F")
}
# merge
gpo<-rbind(se1516,sw1516,se1415,sw1415,se1314,sw1314,se1213,sw1213,
se1112,sw1112,se1011,sw1011,se0910,sw0910,se0809,sw0809,
se0708,sw0708,se0607,sw0607)
gpo$PO<-factor(gpo$PO)
gpo<-gpo[,-1]
# quick view of data
knitr::kable(head(gpo,n=3))| THPM | THPA | THPP | FTM | FTA | FTP | OREB | DREB | REB | AST | TOV | STL | BLK | BLKA | PF | PFD | TPA | TPM | TPP | PO |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 10.7 | 29.6 | 36.2 | 16.3 | 21.7 | 74.8 | 10.6 | 33.9 | 44.5 | 22.7 | 13.6 | 6.7 | 3.9 | 4.4 | 20.3 | 20.6 | 54.4 | 28.0 | 51.47059 | P |
| 8.6 | 23.3 | 37.0 | 20.8 | 26.7 | 77.7 | 10.2 | 33.2 | 43.4 | 18.7 | 13.1 | 7.8 | 5.5 | 5.4 | 19.6 | 22.0 | 58.0 | 28.1 | 48.44828 | P |
| 9.9 | 28.4 | 35.0 | 15.6 | 20.0 | 78.3 | 8.3 | 33.8 | 42.1 | 25.6 | 15.0 | 9.1 | 5.9 | 5.0 | 19.1 | 18.3 | 56.0 | 28.7 | 51.25000 | P |
str(gpo)## 'data.frame': 300 obs. of 20 variables:
## $ THPM: num 10.7 8.6 9.9 8.7 10.6 6.1 8.1 9 7.9 8.6 ...
## $ THPA: num 29.6 23.3 28.4 26.1 29.4 18 23 26.2 21.4 24.2 ...
## $ THPP: num 36.2 37 35 33.5 36.2 33.6 35.1 34.5 37.1 35.8 ...
## $ FTM : num 16.3 20.8 15.6 18.5 18.7 17.1 17.4 17.1 16.5 16.5 ...
## $ FTA : num 21.7 26.7 20 23.5 23.7 23 22.8 25.5 21 22.5 ...
## $ FTP : num 74.8 77.7 78.3 78.8 79 74.5 76.4 66.8 78.7 73 ...
## $ OREB: num 10.6 10.2 8.3 11.6 9 9.8 10.3 12.5 11.1 9.1 ...
## $ DREB: num 33.9 33.2 33.8 33.3 35 34.3 33.9 33.9 35.2 32.8 ...
## $ REB : num 44.5 43.4 42.1 44.9 43.9 44.1 44.2 46.3 46.3 41.8 ...
## $ AST : num 22.7 18.7 25.6 24.2 21.7 20.8 21.2 19.4 22.8 24.5 ...
## $ TOV : num 13.6 13.1 15 13.7 12.5 14.1 14.9 13.5 13.9 14.5 ...
## $ STL : num 6.7 7.8 9.1 9.2 7.3 6.7 9 7 6 8.6 ...
## $ BLK : num 3.9 5.5 5.9 4.2 5.3 6.5 4.8 3.7 5.7 3.9 ...
## $ BLKA: num 4.4 5.4 5 5.5 5.5 4.1 4.5 4.5 5.7 4.3 ...
## $ PF : num 20.3 19.6 19.1 21.9 18.1 18.3 20 19 18.8 20.8 ...
## $ PFD : num 20.6 22 18.3 21 20.4 19.6 20.4 21.6 18.7 20.1 ...
## $ TPA : num 54.4 58 56 63.1 55 63.7 62.2 60.2 66 61.6 ...
## $ TPM : num 28 28.1 28.7 30.5 26.4 32.3 30.2 28.9 30.7 30.9 ...
## $ TPP : num 51.5 48.4 51.2 48.3 48 ...
## $ PO : Factor w/ 2 levels "F","P": 2 2 2 2 2 2 2 2 1 1 ...
These are the names, class type of variables after handling data. We can also check first few observations. In total, there are 300 observations and 20 variables. Simple description of the variables is as follows. :
| Variable Name | Description | Variable Name | Description |
|---|---|---|---|
| PO | Playoff or Failure | OREB | Offensive Rebound |
| TPM | 2 Points Made | DREB | Defensive Rebound |
| TPA | 2 Points Attempted | REB | Total Rebound |
| TPP | 2 Points Percentage | AST | Assist |
| THPM | Three Points Made | TOV | Turn Over |
| THPA | Three Points Attempted | STL | Steal |
| THPP | Three Points Percentage | BLK | Block |
| FTM | Free Throw Made | BLKA | Blocked Shots |
| FTA | Free Throw Attempted | PF | Personal Foul |
| FTP | Free Throw Percentage | PFD | Personal Foul Drawn |
#train and test set
set.seed(0704)
ind<-sample(1:nrow(gpo),nrow(gpo)/5,replace=F)
traingpo<-gpo[-ind,]
testgpo<-gpo[ind,]We produced training data set and test data set by generating random numbers.
#random forest
set.seed(0704)
gporf<-randomForest(PO~ .,data=traingpo,ntree=300,proximity=TRUE)
print(gporf)##
## Call:
## randomForest(formula = PO ~ ., data = traingpo, ntree = 300, proximity = TRUE)
## Type of random forest: classification
## Number of trees: 300
## No. of variables tried at each split: 4
##
## OOB estimate of error rate: 21.25%
## Confusion matrix:
## F P class.error
## F 76 29 0.2761905
## P 22 113 0.1629630
varImpPlot(gporf)This is the result of the random forest. The error rate is 21.25%.
TPP showed the largest importance. Since TPP was the most important factor in winning the game, it seems reasonable.
TPP was followed DREB and TOV. As we mentioned in the first analysis, rebound refers to making less mistakes. Turn overs are typical variable representing frequency of made mistakes. Variables, representing mistakes, showed large influence in making playoffs.
# 실제 데이터로 test
gpoPred <- predict(gporf, newdata=testgpo)
gpotesttable<-table(gpoPred, testgpo$PO)
list("confusion matrix"=gpotesttable,"accuracy"=acc(table(gpoPred, testgpo$PO)),
"error"=err(table(gpoPred, testgpo$PO)))## $`confusion matrix`
##
## gpoPred F P
## F 19 2
## P 14 25
##
## $accuracy
## [1] 0.7333333
##
## $error
## [1] 0.2666667
This is the result of the random forest applied to the test data. The error rate is 26.7%, which is little more higher than above, 21.25%.
gpoc<-gpo[,c(18,17,19,1:10,12:16,11)]
iplotCorr(gpoc, chartOpts=list(cortitle="Playoff", scattitle="Scatterplot"))(TPM, TPA), (THPM, TPA), (THPA, TPA), (THPA, THPM), (FTA, FTM), (FTM, PFD), (FTA, PFD), and (REB, DREB) showed the absolute value of correlation coefficient higher than 0.6. When you see the above result of variable importance from random forest, TPA showed the larger importance than TPM, THPM, THPA. Therefore, we only selected TPA among them. DREB showed larger importance than REB and PFD showed larger importance than FTM and FTA. Hence, we removed REB, FTM and FTA.
We used LDA instead of logistic regression, as logistic regression becomes unstable when the classes are well separated and when there are only few data. Although some of the explanatory variables did not satisfy the normality assumption, we had to use LDA for better classification.
PO is used as target variable because the analysis is to classify teams that made playoffs. TPP, DREB, TOV, THPP, TPA, BLKA, BLK, AST, PF, OREB, STL, PFD, and FTP were chosen as explanatory variables.
ldagpo<-lda(PO~TPP+DREB+TOV+THPP+TPA+BLKA+BLK+AST+PF+OREB+STL+
PFD+FTP,data=traingpo)
ldagpo## Call:
## lda(PO ~ TPP + DREB + TOV + THPP + TPA + BLKA + BLK + AST + PF +
## OREB + STL + PFD + FTP, data = traingpo)
##
## Prior probabilities of groups:
## F P
## 0.4375 0.5625
##
## Group means:
## TPP DREB TOV THPP TPA BLKA BLK AST
## F 47.63082 30.34762 14.79619 34.80190 63.67238 5.108571 4.660952 20.99524
## P 49.43702 31.75037 14.13407 36.16741 61.25481 4.647407 5.013333 22.11333
## PF OREB STL PFD FTP
## F 21.09524 11.22190 7.394286 20.42667 75.54
## P 20.33259 10.79333 7.625926 20.79556 75.98
##
## Coefficients of linear discriminants:
## LD1
## TPP 0.18470909
## DREB 0.27101461
## TOV -0.61307000
## THPP 0.11809325
## TPA -0.03471521
## BLKA -0.20815296
## BLK 0.40362758
## AST 0.04709369
## PF -0.20485638
## OREB 0.20007467
## STL 0.32509012
## PFD 0.32247847
## FTP 0.01473884
\[D = 20.77+0.18·TPP+0.27·DREB-0.61·TOV+0.11·THPP-0.03·TPA-0.21·BLKA+\] \[0.40·BLK+0.05·AS.T-0.20·PF+0.20·OREB+0.33·STL+0.32·PFD+0.01·FTP\]
These are the result of LDA. TOV showed the largest absolute coefficient. Followed by BLK, STL, and PFD. Variables representing mistakes showed relatively larger influence.
# confusion matrix
gporesults2 <- predict(ldagpo,newdata=testgpo)
gpotab <- table("Actual" = testgpo$PO,
"Predicted" = gporesults2$class)
list("confustion matrix"=gpotab, accuracy=acc(gpotab),error=err(gpotab))## $`confustion matrix`
## Predicted
## Actual F P
## F 22 11
## P 2 25
##
## $accuracy
## [1] 0.7833333
##
## $error
## [1] 0.2166667
The model’s accuracy was 78.3%. To be specific, most of the teams that have made playoffs were predicted correctly (93% accuracy). However, some teams that failed to enter playoffs were predicted incorrectly (67% accuracy).
# auc
pgpo <- predict(ldagpo, newdata=testgpo, type="response")
prgpo <- prediction(pgpo$posterior[,2], testgpo$PO)
prfgpo <- performance(prgpo, measure = "tpr", x.measure = "fpr")
plot(prfgpo,main="ROC Curve of LDA")
abline(a=0,b=1,col=2,lwd=3)aucgpo <- performance(prgpo, measure = "auc")
aucgpo <- aucgpo@y.values[[1]]
aucgpo## [1] 0.8675645
We drew Receiver Operating Characteristic(ROC) Curve to evaluate the model. Area Under the Curve(AUC) was 0.87. LDA classified the data very well.
gporesults1 <- predict(ldagpo)
ldahist(data = gporesults1$x[,1], g=traingpo$PO,col="lightblue",nbins=25)This is a histogram of value of discriminant function, using training data set. They seem to be discriminated quite well.
ldahist(data = gporesults2$x[,1], g=testgpo$PO,col="tomato",nbins=15)This is a histogram of value of discriminant function, using test data set. They also seem to be discriminated well.
ldaplot <- data.frame(index=c(1:length(gporesults2$x)), x=gporesults2$x[,1],
pre.po=gporesults2$class, test.po=testgpo$PO)
plot_ly(data = ldaplot, type="scatter", x = ~index, y = ~x,
symbol = ~pre.po,symbols=c('x','circle'),colors=c('blue','orange'),
marker=list(size=10)) %>%
add_trace(y=~x, x=~index, type="scatter", symbol=~test.po, symbols=c('x','circle'),
colors=c('green','red'), marker=list(size=12)) %>%
layout(title="Playoff or Not",
yaxis=list(range = c(-3,3), zeroline = FALSE,title='LDA value'),
xaxis=list(title='Index'))“X” colored in blue and circle colored in orange are the points of predicted results. “X” colored in green and circle colored in red are the points from real test data set. As you can see, even though some of them are not discriminated correctly, it seems fair to say that LDA classified very well in general.
To draw radar plot, we multiplied coefficients of the discriminant function with mean of each variables again, and put variables that had five highest values into the radar plot.
ldagpo$scaling[c("TPP","TOV","DREB","PFD","PF"),]*
apply(gpo[,c("TPP","TOV","DREB","PFD","PF")],2,mean)## TPP TOV DREB PFD PF
## 8.981049 -8.845987 8.444454 6.656708 -4.229465
TPP, TOV, DREB, PFD, and PF were chosen in descending order of influence.
po.m <- traingpo[,c(19,11,8,16,15,20)]
po.p<-po.m[po.m$PO=="P",1:5]
po.f<-po.m[po.m$PO=="F",1:5]
po.p<-apply(po.p,2,mean)
po.f<-apply(po.f,2,mean)
po.m<-rbind(po.p,po.f)
row.names(po.m)<-c("P","F")
po.dd<- data.frame(MakePO=row.names(po.m),
TPP=round(po.m[,1]^4/max((po.m[,1])^4),2),
TOV=round(po.m[,2]^4/max(po.m[,2])^4,2),
DREB=round(po.m[,3]^4/max((po.m[,3])^4),2),
PFD=round(po.m[,4]^4/max(po.m[,4])^4,2),
PF=round(po.m[,5]^4/max(po.m[,5])^4,2))
po.d<- data.frame(MakePO=row.names(po.m),
TPP=round(po.m[,1]^4/max((po.m[,1])^4),2),
TOV=round(po.m[,2]^4/max(po.m[,2])^4,2),
DREB=round(po.m[,3]^4/max((po.m[,3])^4),2),
PFD=round(po.m[,4]^4/max(po.m[,4])^4,2),
PF=round(po.m[,5]^4/max(po.m[,5])^4,2))
po.d <- melt(po.d,id=c("MakePO"))
po.d <- po.d[order(po.d$MakePO),]
po.d$degree <- seq(0,288,72) # 24 responses, equals 15 degrees per response
po.d$o <- po.d$value * sin(po.d$degree * pi / 180) # SOH
po.d$a <- po.d$value * cos(po.d$degree * pi / 180) # CAH
po.d$o100 <- 1 * sin(po.d$degree * pi / 180) # Outer ring x
po.d$a100 <- 1 * cos(po.d$degree * pi / 180) # Outer ring y
p = plot_ly()
for(i in 1:5) {
p <- add_trace(
p,
x = c(po.d$o100[i],0),
y = c(po.d$a100[i],0),
evaluate = TRUE,
line = list(color = "#d3d3d3", dash = "3px"),
showlegend = FALSE
)
}
po.d.rad<-po.d[c(1:10,1,6),]
p %>%
add_trace(data = po.d.rad,
x = po.d.rad$o, y = po.d.rad$a, color = po.d.rad$MakePO,
mode = "lines+markers",
hoverinfo = "text",
text = paste(po.d.rad$MakePO, po.d.rad$variable,round(po.d.rad$value * 100), "%")) %>%
add_trace(data = po.d, x = po.d$o100, y = po.d$a100,
text = po.d$variable,
hoverinfo = "none",
textposition = "top middle", mode = "lines+text",
line = list(color = "#d3d3d3", dash = "3px", shape = "spline"),
showlegend = FALSE) %>%
layout(xaxis = list(range = c(-1.25,1.25), showticklabels = FALSE, zeroline = FALSE, showgrid = FALSE),
yaxis = list(range = c(-1.25,1.25), showticklabels = FALSE, zeroline = FALSE, showgrid = FALSE))Playoffs teams made less PF and TOV than non-playoffs teams. In TPP, DREB, and PFD, they were greater than non-playoffs teams. It can be interpreted that playoffs teams make less mistakes, score more points, and draw more mistakes from their opponents.
In winning the games, scoring was the most important factor among all. Making less mistakes was also important. However, scoring was much more important.
In making playoffs, making less mistakes was the most important factor.
Teams with great players are more likely to score more. However, to possess those players, it requires unmeasurable amount of money. Also, teams with bad teamwork usually makes alot of mistakes. Teams with great teamwork require much less amount of money and are more likely to achieve a long term goal of the season. Nowadays, too much money are spent in the field of sports. The owners of the teams often think money would bring them championship very easily. As you can see from this analysis, it’s not. Teams must put more efforts on building team’s chemistry to accomplish their long term goals rather than just buying some valuable players.
* Writer: Sung-In Cho, Yoon-Seo Jang
* Creation date: Dec 16, 2016